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Abstract 

In machine learning and computer vision, input images are often filtered to in- 
crease data discriminability. In some situations, however, one may wish to pur- 
posely decrease discriminability of one classification task (a "distractor" task), 
while simultaneously preserving information relevant to another (the task-of- 
interest): For example, it may be important to mask the identity of persons con- 
tained in face images before submitting them to a crowdsourcing site (e.g.. Me- 
chanical Turk) when labeling them for certain facial attributes. Another example 
is inter-dataset generalization: when training on a dataset with a particular covari- 
ance structure among multiple attributes, it may be useful to suppress one attribute 
while preserving another so that a trained classifier does not leam spurious correla- 
tions between attributes. In this paper we present an algorithm that finds optimal 
filters to give high discriminability to one task while simultaneously giving low 
discriminability to a distractor task. We present results showing the effectiveness 
of the proposed technique on both simulated data and natural face images. 



1 Introduction 

In machine learning and computer vision, images are commonly filtered prior to classification to 
enhance class discriminability. Such filters may consist of manually constructed filters (e.g., low- 
pass, band-pass filters) or may be learned directly from the data (e.g., using Deep Belief Networks 
Q or Independent Components Analysis 1 1 1). However, there also exist scenarios in which it may 
be useful to intentionally decrease discriminability for one classification task (a "distractor" task), 
while enhancing or at least preserving discriminability for another task (the task-of-interest). Dis- 
criminability can pertain to perception by humans, or analysis by a machine classifier. Two scenarios 
where such filtering is useful include (1) preservation of privacy during data labeling, and (2) gen- 
eralization to datasets with different correlation structure. 

(1) Preservation of privacy: Machine learning is increasingly making use of crowdsourcing services 
such as the Amazon Mechanical Turk, in which not all labelers can be trusted. In some situations, 
the data to be labeled may contain sensitive information that should not be released to the public, 
e.g., the identity of people's faces or the geographical locations of satellite images. It may be useful 
to first filter the images before uploading them to the Mechanical Turk so that identity/location is 
removed, but so that the task-of-interest remains highly discriminable. For the case of facial identity 
removal, this process is known as face de-identification |9|. 

(2) Generalization to datasets with dijfe rent correlation structure: In some training datasets there ex- 
ist strong correlations between different attributes that can impair generalization performance to data 
with different covariance structure (covariate shift). Consider, for example, a classifier, intended to 
recognize some attribute A, that is trained on a dataset in which there is a strong correlation between 
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Figure 1 : A minimal example in showing (left) unfiltered data, data filtered to preserve Task A's 
and suppress Task B's discriminability (center), and data filtered to suppress Task A's and preserve 
Task B's discriminability (right). 



attributes A and B. Such a classifier may perform very badly when tested on a different dataset in 
which the correlation between A and B is low or perhaps negative. It may be useful, when training 
a classifier for A, to first filter the training data to preserve discriminability of A, while suppressing 
discriminability of B, so that the spurious correlation between A and B is not learned. 

In this paper, we present a novel algorithm for learning an image filter (parameterized by 9) from 
labeled data that simultaneously preserves discriminability of the task-of-interest while suppressing 
discriminability of the distractor task. In this sense, the filter "discriminately decreases discrim- 
inability" of the images. In the experiments in the paper, we focus on image filters, but in fact the 
data can be of any dimensional representation. We focus on discriminating binary attributes, but 
as shown in Section |6] suppression of binary gender discrimination also significantly removes face 
identiability as well. Before presenting our algorithm in Secti on [3] we first provide a simple exam- 
ple of "discriminately decreasing discriminability" in Section |2j The rest of the paper consists of 
experimental results. 



2 Simple example in i?^ 

Consider the set of 28 data points {xi} (in R^) shown in Figure [T| (left): Each point Xi is given 
binary labels for two labeling tasks. Points labeled for Task A are shown in magenta, while points 
labeled 1 for task A are black. On the other hand, points labeled for Task B are marked as crosses, 
while points labeled 1 are shown as circles. In their unfiltered original form, both tasks are easily 
discriminated, as illustrated in Figure [T] 

Suppose now that we filter the data using 9i (in this case, a general linear transformation), as shown 
in the center part of the figure: Task A (color) is highly discriminable, while Task B (marker) is not 
- the two marker styles (circles and crosses) appear to overlap. Similarly, we can use 02 to suppress 
discriminability of Task A and preserve discriminability of Task B, in which case we arrive at the 
filtered points shown in Figure [T] (right). The goal of the algorithm in this paper is to learn such 
linear transformations (filters) automatically. 



3 Algorithm: Learning a filter to discriminately decrease discriminability 



The proposed method requires quantifying data discriminability as J*, as described in the next 
subsection. The key is that J* can be found analytically as a function of its input. Using J*, we 
pose an optimization problem to maximize the ratio of discriminabihties of Tasks A and B w.r.t. the 
filter 6 which transforms the data. 
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3.1 Quantifying discriminability as J* 



The measure of discriminability we use is the ratio of between-class variance to within-class vari- 
ance, first proposed by Fisher |4| and used in Fisher's Linear Discriminant analysis. 

Let {xi} represent a set of data (column vectors) to be classified, where each Xi e TZ'^, and let 
Hi e {0, 1} represent the binary class label for each Xi for some labeling task. In our setting, each 
Xi might be a face image with d pixels, and each yi might represent, for example, whether or not 
the person in image i is smiling. One useful measure of discriminability of the data w.rt. the class 
labels is Fisher's discriminability criterion, J, which measures the ratio of between-class variance 
B to the within-class variance W after projecting the {xi} onto some direction p e R'^. Depending 
on the choice of p, the data {xi} may become more or less discriminable w.r.t. the class labels {yi}. 
When Fisher's linear discriminant is used for actual classification, the p represents the normal vector 
to the separating hyperplane of the two classes. 

Notation: Let A^o denote the number of data vectors {xi} such that yi = 0, and let Xq denote the 
d X Nq matrix formed from the Nq data points in class 0. We can define iVi and Xi analogously. 
We write the mean data vector for class as a;o, i.e., xo == X^r j, =o define xi analogously. 

X is the mean over all {xi}. Finally, define Xq (or Xi) as the d x Nq (or d x Ni) matrix containing 
A^o (or A^i) copies of xo (or xi). 

Given the notation above. Fisher's linear discriminability can be computed as 

J{Xi,Xo,p) = (1) 
p ' Wp 

where the between-class variance is defined as i? = {xi — xq){xi — xq)^ and the within-class 
variance is defined as = {Xi - Xi){Xi - XiY + (Xq - Xo)(Xo - XqY . W can be 
regularized as W^^^ — al + [1 — a)W with regularization parameter a e [0,1]. For the remainder 
of the paper we refer to Wieg simply as W . 

One advantage of Fisher's linear discriminant over other classification methods (e.g., support vector 
machines, multivariate logistic regression) is that the optimal p* that maximizes the discriminability 
of the Xi from the Xq can be found analytically |3 1: 

p*(Xi,Xo) = argmax J(Xi,Xo,p) = W-\x^-Xq) (2) 
p 

Given this solution forp*, we can define the "Fisher maximal discriminability" of Xi and Xq as: 

J*(Xi,Xo) = J(Xi,Xo,p*(Xi,Xo)) (3) 



3.2 Discriminability for two tasks 

Let us now consider a set of data {xi} where each Xi G TZf^, as above. However, now we are 
interested in two sets of class labels for two different binary labeling tasks A and B. For instance, the 
{xi} might represent a set of face images, and task A might correspond to whether Xi is a smiling 
face or not, whereas task B might represent whether the face in Xi is male or female. Instead of Xq 
and Xi, we define Xoa and Xia to represent the data points {xi} that are labeled as class or 1, 
respectively, for task A; we define Xqi, and Xu, analogously for task B. Then, the Fisher maximal 
discriminability for Task A is J* {Xiai Xqo) and for Task B is J* [Xn,, ^ofc)- 



3.3 Finding filter 9 to ensure high J* for Task A, low J* for Task B 

Now, suppose that we filter each Xi using any filter function F{6, ■) that is differentiable in 6. By 
varying 6, we can change the Fisher maximal discriminability J* for both tasksj^Two useful filtering 
operations include (1) Convolution: F{0^x) — x * 6, where 6 represents the convolution kernel 
in vector form; and (2) pixel-wise "masking": F{0, x) — T)\?Lg{6)x where Diag((?) represents a 
diagonal matrix formed from the vector 6. In this case, 9 represents a "mask" placed over the 

' J* can be affected by filter 6 even when the hnear transformation that the filter induces is invertible. In 
contrast, linear separability (existence/non-existence of a separating hyperplane) cannot be affected by any 
invertible linear transformation - see Supp. Materials for a proof. 
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image that allows the original image's pixels to pass through with varying strength. Let us define 
/j(6() = j:[e, Xi) as the output of the filter J" on Xi, and let us define i^oa(6'), Fia{0) , Fab{9) , Fu{9) 
analagously to their (unfiltered) counterparts Xpa, -'^la, ^06i ^ib- 

Goal: We wish to find the filter parameter vector 6 that gives high Fisher maximal discriminability 
(J*) to Task A, while simultaneously giving low Fisher maximal discriminability to Task B. This can 
be formulated as an optimization problem over 9 in several different ways; we choose the following 
"ratio of discriminabilities" metric R{9): 



(4) 



where /? > is a scalar regularization parameter on 9. Since all of the FQ are differentiable 
functions of 9, and since J*(-, •) is given by the simple formulas in Equations |3] and [T] we can use 
gradient descent to locally minimize the objective function in Equation |4] w.r.t. 9. The derivative 
expressions are given in the Supplementary Materials. 



3.4 Reconstruction from filtered images 

The gradient descent procedure described above will find a 9 that locally minimizes R{9), but there 
is no guarantee that the filtered images F will visually resemble the original images X or that 
humans can interpret them. For machine classification (e.g., when learning a filter to improve inter- 
dataset performance), this may not matter, but for human labeling applications, it may be necessary 
to "restore" the filtered images to a more intuitive form. Hence, as an optional step, linear ridge 
regression can be used to convert the filtered images to a form more closely resembling the 
original images X, while still preserving the property that they are highly discriminability for Task 
A and not highly discriminable for Task B. In particular, we can compute the d x d + 1 (the extra 
+1 is for the bias term) Hnear transformation P that minimizes 



X-P 



F 
1 



+ 7 



PI 



where 7 > is a scalar ridge strength parameter, / is the identity matrix except that the last {d + 1th) 
diagonal entry is instead of 1 (so that there is no regularization on the bias weight), and Fr means 
Frobenius norm. 

The ridge term in the linear reconstruction is critical: because many of the filters that the gradient 
descent procedure learns correspond to invertible linear transformations, linear regression without 
regularization would transform each fi back to Xi with no loss of information, which would defeat 
the purpose of filtering at all. With ridge regression, on the other hand, only the "more discernible" 
aspects of the image (i.e., the task-of-interest) are restored clearly, while the "less discernible" as- 
pects (pertaining to the distractor task) are not. By varying 7, one can cause each "reconstructed" 



image gi (where gi ^ P 



1 



) to strongly resemble the mean image x (for large 7) or to strongly 



resemble its unfiltered counterpart Xi (for small 7). In practice, 7 is chosen based on visual inspec- 
tion of the reconstructed training images so that, to the human observer, the task-of-interest is clearly 
discriminable while the distractor task is not. 



4 Experiment I: synthetic data 

In our first experiment we studied whether the proposed algorithm could operate on images (16 x 16 
pixels) consisting of simple line patterns in order to suppress lines in one direction while preserving 
them in another For the filtering operation, we chose to learn a convolution kernel of 5 x 5 pixels. 
In this study, all images contained one horizontal line and one vertical line at random locations: In 
Task A, an image was labeled if it contained a vertical line in the left half of the image, and it was 
labeled 1 if its vertical line was in the right half. In Task B, an image was labeled if its horizontal 
line was in the top half, and labeled 1 if it was in the bottom half Each image Xi G /ji^xie 
generated by adding one vertical and one horizontal line (of pixel intensity 1) at random image 
positions, and then adding uniform noise in C/[0, 0.5) to all pixels in the image. Example images are 
shown in Figure |2] (left). 
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Figure 2: Left: Synthetic images consisting of vertical and horizontal lines at different positions. 
Center: gradient descent curve over R{6) to a learn a filter to preserve Task A and suppress Task B. 
Right: The filters learned at corresponding gradient descent steps. 



Unfiltered patches {xi} 



Filtered patches {fi} 




Figure 3: Top: unfiltered image patches consisting of superimposed vertical and horizontal lines plus 
uniform noise. Bottom: the same images filtered with a convolution kernel designed to suppress 
discriminability of Task B (horz. lines) while preserving discriminability of Task A (vert, lines). 



After generating 1000 images according to the procedure above, we initialized the convolution ker- 
nel 9 E i?^^^ to random values from U[0, 1) (shown in Figure |2] as the filter kernel at gradient 
descent step 0) and then applied the algorithm above to learn a filter to preserve Task A while sup- 
pressing Task B. We set f3 to 0.5. The descent curve is show in Figure |2] (center), and the learned 
filter kernel at every 10 steps is shown below the graph. 

After filtering the images using the convolution kernel learned after 50 descent steps, we arrived 
at the images shown in Figure |3] Notice how the horizontal lines have been almost completely 
eradicated, thus decreasing class discriminability for Task B. 



5 Experiment II: natural face images 
5.1 Preserve expression, suppress gender 

We applied the proposed filter learning method to natural face images from the GENKI dataset 
lfT2l . which consists of 60,000 images that have been manually labeled for 2 binary attributes - 
smile/non-smile and male/female - as well as the 2D positions of the eyes, nose, and mouth, and the 
3D head pose (yaw, pitch, and roll). In this experiment we assessed whether a filter could be learned 
to preserve discriminability of expression (smile/non-smile), while suppressing discriminability of 
gender. We used a pixel-wise "mask" filter (see Section[3]l of the same size as the images (16 x 16 
pixels). 

From the whole GENKI dataset we selected a training set consisting of 1740 images (50% male and 
50% female; 50% smile and 50% non-smile) whose yaw, pitch, and roll parameters were all within 
5 deg of frontal. All of the images were registered to a common face cropping using the center of 
the eyes and mouth as anchor points. They were then downscaled to a resolution of 16 x 16 pixels. 
In addition, we similarly extracted a separate testing set consisting of 100 images (50 males, 50 
females, and 50 smiling, 50 non-smiling) with the same 3D pose characteristics. The filter 9 was 
initialized component-wise by sampling from U[Q, 1). 

Using the training set for learning the filter, and setting the regularization parameter /? = 0.5 {a = 
0.1 as always), we applied conjugate gradient descent for 100 function evaluations. The learned 
filter was then applied to all of the training images. Finally, we applied the image reconstruction 
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Figure 4: Face images from the GENKI dataset that have been filtered to preserve expression and 
suppress gender (left), or to preserve gender and suppress expression (right). Filters were learned 
using the algorithm presented in Section[3] Learned filter masks are shown next to "Learned filter:". 



technique described in Section |3.4| to restore the filtered images to a form more easily analyzable 
by humans. The reconstruction ridge parameter 7 was selected, by looking only at the training 
images, so that smile appeared well discriminable whereas gender did not (in this case, 7 = 6e — 2). 
Examples of the input images as well as the filtered (+ reconstructed) images are shown in Figure 
|4](left). The learned filter mask is shown to the right of the text "Learned filter". As shown in the 
figure, most of the smile information in the filtered images is preserved, and while gender may still 
be partially discernible, much of the gender information has been suppressed by the filter 

To assess quantitatively the ability of the learned filter to preserve expression and suppress gender, 
we posted a labeling task to the Amazon Mechanical Turk consisting of 50 randomly selected pairs 
of filtered images selected from the testing set using the filter learned according to the above proce- 
dure. Each pair contained 1 smiling image and 1 non-smiling image presented in random order (Left 
or Right), and the labeler was asked to select which image - Left or Right - was "smiling more". 
The entire set of 50 image pairs was presented to 10 Mechanical Turk workers, and their opinions on 
each pair were combined using Majority Vot^ with ties resolved by selecting the "Right" image. 
Accuracy of the Mechanical Turk labelers compared to the official GENKI labels was measured as 
the probability of correctness on a 2 alternative forced choice task (2AFC), which is equivalent un- 
der mild conditions to the Area under the Receiver Operating Characteristics curve (A' statistic) that 
is commonly used in the automatic facial expression recognition literature (e.g., [8|). We similarly 
generated a set of 50 randomly selected pairs of filtered images containing 1 male and 1 female. As 
a baseline, we compared gender and smile labeling accuracy of the filtered images to similar tasks 
for the unfiltered images. Results are shown in Table[T] 

As shown in the table, the learned image filter substantially reduced discriminability of gender (from 
98% to 58%), while maintaining high discriminability of expression (94% to 96%) compared to the 
baseline (unfiltered) images. 

Comparison to a manually constructed filter: In the case of expression and gender attributes, 
one might reasonably argue that the "optimal filter" for preserving smile/non-smile and suppressing 
male/female information would be simply to crop and display only the mouth region of each face. 
Hence, we performed an additional experiment in which we compared Mechanical Turk labeling 
accuracy on 50 pairs of filtered images, generated similarly as described above, using a manually 



^We also applied an algorithm for optimal integration of crowdsourced labels 1131 : see Supp. Materials. 
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Table 1 : Accuracy (2 AFC) of workers on Mechanical Turk when labeling filtered GENKI images 



Filter method 


Expression 


Gender 


Unfiltered (baseline) 


94% 


98% 


Learned filter 1: Preserve expression, suppress gender 


96% 


58% 


Manually constructed filter: show mouth region only 


96% 


74% 


Learned filter 2: Preserve gender, suppress expression 


64% 


86% 



Whose face is this? 



Match the filtered face image above to its unfiltered image below. 




abcdefghij 



Figure 5: Top: The preserve-smile, suppress-gender filter both allows smile/non-smile information 
to pass through, and also serves as a "face-identification" mechanism, as illustrated in the face 
recognition task above. The correct face match is (f). 



constructed mask filter consisting of just the mouth region (rows 1 1 through 15 and columns 4 
through 13 of each 16 x 16 face image). Results are in Table [T] while smile discriminability is 
equally high as the learned filter 1, gender discriminability using the manually constructed filter was 
substantially higher (74% compared to 58%), indicating that the manually constructed filter actually 
allowed considerable gender information to pass through. This suggests that a learned filter can 
work better than a manually constructed one even when strong prior domain knowledge exists. 

5.2 Preserve gender, suppress expression 

Analogously to Section |5.1[ we also learned a filter to preserve gender and suppress expression, 
using an identical training procedure to that described above. Examples of the filtered {+ recon- 
structed) images (7 = 9 x 10^'^) are shown in Figure |4] (right). Note how, for face image (b), the 
filter not only "suppressed" the expression of the non-smiling female, but actually seems to "flip" the 
smile/non-smile label so that the woman appears to be smiling. The accuracy compared to baseline 
(unfiltered) images is shown in Table [T] While accuracy of gender labeling did drop from 98% to 
86%, it dropped much more for the smiling labeling (94% to 64%) compared to unfiltered images. 

6 Experiment III: Preserving privacy in face images (face de-identification) 

The filters learned in Section [5] to preserve smile while suppressing gender information were not 
designed specifically to suppress the faces' identity. In practice, however, we found that the identity 
of the people shown was very difficult to discem in the filtered images. Indeed, it is possible that 
gender represents one of the first "principal components" of face space, and that, by removing 
gender, one implicitly removes substantial identity information as well. 

To test the hypothesis that identity was effectively masked by suppressing gender, we created a face 
recognition test consisting of 40 questions similar to Figure |5] a single face must be matched to one 
of 10 unfiltered candidate face images. In half of the questions, the face to be matched was filtered 
using the preserve-expression, suppress-gender filter (Section |5]l. In this case, the matching task 
was very challenging. In the other half of the questions, the face to be matched was unfiltered, and 
hence the matching task was nearly trivial. The order of the questions presented to the labelers was 
randomized, and we obtained results from 10 workers on the Amazon Mechanical Turk. 
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Results: For the unfiltered images, the rate of successful match was 100% for each of the 10 labelers. 
For the filtered images, the rate of successful match, using Majority Vote, was 15%, indicating that 
the preserve-smile, suppress-gender filter also removed identity. The highest successful matching 
rate of the filtered images for any one labeler was 30%. Basehne rate for guessing was 10%. 

7 Experiment IV: Filtering to improve generalization across datasets 

Here we provide a proof-of-concept of learning a filter that improves generaUzation to novel datasets. 
Consider a dataset of face images, such as GENKI, with a positive correlation between gender and 
smile. If a male/female classifier were trained on these data, then it might learn to distinguish gender 
not just by male/female information alone, but also by the correlated presence of smile. When tested 
on a different dataset with a different covariance structure, e.g., with negative correlation between 
smile and gender, the classifier would likely perform badly. If we first filter the data to suppress 
smile information but preserve gender information, then the trained classifier might not suffer when 
applied to the new dataset. 

To test this hypothesis, we partitioned the GENKI images used in Section|5]into a training set (4062 
images) and a testing set (970 images). As before, all images were 16 x 16 pixels. In the training 
set, the correlation between smile and gender was +0.64, whereas in the testing set, it was —1. 
We then trained two support vector machine classifiers with radial basis function (RBF) kernels to 
classify gender. One classifier was trained on filtered training images, using the gender-preservation, 
smile-suppression filter learned in Section |5] and the other was trained on unfiltered images. The 
RBF width 7 was optimized independently (7 e {10^^, 10^*^, . . . , 10+^}) for each classifier using 
a "holdout" set (a randomly selected 20% subset of the training images). The classifier trained on 
unfiltered images was then applied to the unfiltered testing set, and the classifier trained on filtered 
images was applied to the filtered testing set. 

Results: Filtering the images using the gender-preservation, smile-suppression filter resulted in 
substantially increased generalization performance: 2AFC accuracy was 0.92 for the SVM trained 
on filtered images, whereas it was only 0.79 for the SVM trained on unfiltered images. 

8 Related work 

We are unaware of any work that specifically learns filters to simultaneously preserve and suppress 
different image attributes. However, the approach taken in this paper is somewhat reminiscent of 
work by Birdwell and Horn ||2l, in which an optimal combination of a fixed set of filters is learned 
to minimize the conditional entropy of class labels given the filtered inputs. 

In terms of applications to data privacy, our method is related to "face de-identification" methods 
such as |9 6, 5|. Such methods identify faces which are similar either in terms of pixel space (|9, 5|), 
eigenface space (|9|), or Active Appearance Model parameters (|6 |), and then replace clusters of k 
similar faces with their mean face, thus guaranteeing that no face can be identified more specifically 
than to a cluster of k candidates. However, in contrast to our proposed algorithm, these methods 
cannot be "reversed" to maximally preserve identity while minimizing discriminability of a given 
face attribute. 

For the application of generalizing to datasets with different image statistics, our work is related to 
the problem of covariate shift [11 1 and the field of transfer learning [10|. The method proposed in 
our paper is useful when dataset differences are known a priori - the learned filter helps to overcome 
covariate shift by altering the underlying images themselves. 

9 Summary 

We have presented a novel method for learning filters that can preserve binary discriminability for the 
task-of-interest, while suppressing discriminability for a distractor task. The effectiveness of the ap- 
proach was demonstrated on synthetic as well as natural face images. Interestingly, the suppression 
of gender implicitly removed considerable facial identity information, which renders the technique 
useful for labeling tasks where personal identity should remain private. Finally, we demonstrated 
that "discriminately decreasing discriminality" may help classifiers to generalize across datasets. 
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Supplementary Materials 

9.1 Proof: Class Separability Unaffected by Invertible Linear Transformation 

Let sets Xi and Xq contain the data points (column vectors) for classes 1 and 0, respectively. Let T 
be an arbitrary invertible linear transformation. We define separability of Xi and Xq to mean that 
there exists a hyperplane, with normal vector w, such that w^Xi > w^Xj for every Xi G Xq and 
every Xj E Xi . Claim: 



3w : w Xi > w Xj ■^=> 3w : w {Txi) > w (Txj) 



Proof of There exists w such that w Xi > w Xj for every G Xq and every Xj € Xi. Since 
T is invertible, the matrix T^^ exists. Hence, there exists w such that w^T^^Txi > w^T^^Txj. 
We can define = w^T~^. Then, (Txi) > (Txj) for every Xi S Xq and every xj £ Xi. 

Proof of -4=: There exists w such that (Txi) > [Txj) for every Xi € Xq and every Xj G Xi. 
Then we can define = nf^T, and we have Xi > w^Xj for every Xi € Xq and every xj G -^i. 



9.2 Experiment II: natural face images - supplementary results 

For the natural face image labeling tasks, in addition to combining opinions of the 10 workers on 
Mechanical Turk using Majority Vote, we also tried using a recently developed method by Whitehill, 
et. al 1 13 1 for combining multiple opinions when the quality of the labelers is unknown a priori. Their 
algorithm is called GLAD (Generative model of Labels, Abilities, and Difficulties) and its source 
code is available online. 

In general the results were similar to Majority Vote: 
Accuracy (2AFC) of labels from Amazon Mechanical Turk 



using GLAD 1 13 1 to combine opinions 



Filter method 


Expression 


Gender 


Unfiltered (baseline) 


94% 


98% 


Learned filter 1 : 






Preserve expression, suppress gender 


94% 


54% 


Manually constructed filter: 
(show mouth region only) 


98% 


72% 


Learned filter 2: 






Preserve gender, suppress expression 


64% 


92% 



9.3 Derivatives expressions for gradient descent 

Let dj represent the jth component of vector 0. We abbreviate p* {Fi ,Fq) as p* . Let us also abbre- 
viate each matrix F{9) as simply F. Finally, let /q {6) and {9) represent the vatan filtered vectors 
(with filter 9) for class and 1, corresponding to xq and xi, respectively. 

To compute we can apply the chain rule several times in succession. Most of the derivatives 
are relatively straightforward to derive using standard formulas from linear algebra; however, we 
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present derivatives for the most important terms: 

(^J{Fi^,Foa,P*iFi^,Foa)) j^J{Fib,Fob,P*{Fu,Fob)) 
J{Fia,Foa,P*iFia,Foa)) J {Fib, Fob, p* {Fu, Fob)) 

^JiFoF.pl = ilitl^_J^l^±U-Wp^ 

dej^ "' ''^' p*^wp* {p*^wp*)^dejV ) 




= -y 

Nk d0 

The derivatives for -^fi{0) depend on the particular kind of filter. In the subsections below we find 
the derivatives for a convolution filter, and a pixel-wise "mask" filter. 

Derivatives of linear convolution filter 

For the case of convolving two 1-D functions / and g whose domains are both TZ (all real numbers), 
differentiating the convolution operator is trivial: ^{f * 9) = /■ However, in our case we are 
interested in finite, discrete convolution of a convolution kernel and an image. Consider for the 
moment the case of 1-D convolution of a 3 -element kernel with a 3 -element image x: 

9 = [ a b c ] (5) 
X = [ r s t ] (6) 
^ =K a; = [ ar as + br at + bs + cr bt + cs ct ] (7) 

(8) 

For the purposes of gradient descent, it is necessary to "clip" the convolution operation's output so 
that it retains the same size as the input x; hence, we define: 

Clip(6'*a;) = [as + br at + bs + cr bt + cs ] (9) 

(10) 

We can now differentiate Clip((? * x) w.r.t. each dimension of 9: 



^{C]ip{0*x)) = [s t 0] (11) 

-{aip{6*x)) = [r s t] (12) 

-{CMO*x)) = [0 r s] (13) 

(14) 

Hence, the derivatives of the clipped convolution are computed by "sliding" the row vector x across 
a row of O's and clipping at the appropriate indices. For the case of finite discrete 2-D convolution, 
the situation is analogous - the gradient of the 2-D convolution 0*xis foimd by "sliding" the image 
matrix x over a matrix of O's. 
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While it is possible to specify in mathematical notation the exact indices used for sliding and clip- 
ping, it is tedious and unilluminating. Instead we provide a snippet of Matlab code for computing 
ggClip(^ * x) for 2-D image convolution with 0-padding (no wrap-around): 

function dConvdtheta = computedConvdtheta (x, theta, i) 

% dConvdtheta = COMPUTEDCONVDTHETA (x, theta, i) 

% computes the derivative with respect to theta_i 
% of conv(x, theta) . 

m = sqrt (length (theta) ) ; % theta corresponds to m x m convolution kernel 

n = sqrt (length (x) ) ; % x corresponds to n x n image 

idxs = reshape ( 1 : length (theta) , [ mm ] ) ; 
[r,c] = find (idxs == i) ; 

xim = reshape (x, [ n n ]); 
dConvdthetaim = zeros (m+n-1, m+n-1); 
dConvdthetaim ( r : r+n-l , c:c+n-l) = xim; 

% Trim the matrix back down to only be the "center" part of the convolution result 
upperPadding = ceil (( size (dConvdthetaim, 1) - n) / 2); 
leftPadding = ceil ({ size (dConvdthetaim, 2) - n) / 2); 
lowerPadding = size (dConvdthetaim, 1) - n - upperPadding; 
rightPadding = size (dConvdthetaim, 2) - n - leftPadding; 
dConvdthetaim = dConvdthetaim ( 1+upperPadding : end-lowerPadding, 

1 + lef tPadding : end-rightPadding) ; 
dConvdtheta = dConvdthetaim (:) ; 

end 



Derivatives of element- wise "mask" filter 

If we let the filter J^{9, x) = D{9)x, where D{6) is a diagonal matrix formed from vector 6, then 

where Ij consists of all O's except the {j,j)th entry which is 1. 
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